Overview

Differences from deterministic modeling

A stochastic model incorporates randomness; given the same inputs, a stochastic model may generate different outputs on different runs.

This is useful when the process you want to model is fundamentally stochastic, or if there is heterogeneity in its dynamics.

Unfortunately stochastic models can get complicated, and come with a lot of notation and formalism. Our goal here is to quickly review basic probability theory so that we can use stochastic models productively.


Random variables

A random variable is a variable whose value is random. In this course, random variables will usually have some real-world meaning, like: * \(X\) is the value of a fair coin toss * \(X\) is the number of new HIV infections diagnosed this week * \(X\) is the number of people who had a drug overdose in the last 24 hours * \(X\) is the survival time of a subject living with cancer


Probability

Probability \(\Pr(\cdot)\) is a function whose argument is the value a random variable can take on, which returns a positive number less than or equal to 1. The set of all such values describes the probability distribution.

For discrete random variables, we will write the probability mass function \(\Pr(X=x)\). For continuous random variables, we will use the density, defined below.

We usually describe random varibles in terms of their probability distribution. For example, consider the random variable \(X\) defined by \(\Pr(X=0) = 1/2\) and \(\Pr(X=1)=1/2\). What might \(X\) refer to?


Cumulative probability distribution function

All random variables on the real line have a cumulative distribution funciton (CDF) \[ F(x) = \Pr(X < x) \] The cumulative distribution function fully characterizes the probability distribution of \(X\).


Probability density

The density of a continuous random varible is \(f(x)\), the derivative of CDF \(F(x)\) of \(X\), where it exists \[ f(x) = \lim_{h\to 0} \frac{F(x+h) - F(x)}{h} \]


Conditional probability

\(\Pr(Y=y|X=x)\) is the probability that \(Y=y\), given that \(X=x\). We interpret the statement on the right-hand side of the conditioning sign \(|\) as deterministic. Bayes’ theorem says how to compute conditional probabilities:

\[ \Pr(Y=y|X=x) = \frac{\Pr(Y=y,X=x)}{\Pr(X=x)} \]

Example: \(X\sim\text{Binomial(n,p)}\) and \(0< k \le n\).

\[ \Pr(X=k|X>0) = \binom{n}{k} p^k (1-p)^{n-k} / (1-(1-p)^n) \]


Expectation

The expected value of a random variable is its “average” value.

\[ E[X] = \sum_x x \Pr(X=x) \]

or

\[ E[X] = \int_{-\infty}^\infty x dP(x) \]

average of values x that X can take on, weighted by their probability.

Can a random variable have expectation equal to a value that X can never take? Yes. In the example above \(\Pr(X=x)=1/2\) for \(x=1\) and \(x=0\). But \(E[X]=1/2\).


Conditional expectation

Conditional expectations are just expectations with respect to a conditional probability distribution. \[ E[Y|X=x] = \sum_y y\Pr(Y=y|X=x) \] \[ E[Y|X=x] = \int_{-\infty}^\infty y f(y|X=x) dy \]


Law of total probability

\[ \Pr(X=x) = \sum \Pr(X=x|Y=y) \Pr(Y=y) \] \[ \Pr(X=x) = \int \Pr(X=x|Y=y) dP(y) \]

Law of total expectation

\[ E[X] = \sum E[X|Y=y] \Pr(Y=y) \] \[ E[X] = \int E[X=x|Y=y] dP(y) \]


Independence

Two random variables are independent if their joint probability factorizes into the product of their marginal probabilities, \[ \Pr(X=x,Y=y) = \Pr(X=x) \Pr(Y=y) \] e.g. independent coin flips, die rolls, etc.


Discrete probability distributions

Bernoulli

This is the classic coin-flipping distribution.

The random variable \(X\) has Bernoulli\((p)\) distribution if \(\Pr(X=1)=p\) and \(\Pr(X=0)=1-p\).


Binomial

A Binomial\((n,p)\) random variable is a sum of \(n\) Bernoulli variables, each independent with probability \(p\). The probability mass function is \[ \Pr(X=k) = \binom{n}{k} p^k (1-p)^{n-k} \] where \(0 \le k \le n\). The probability mass function tells a nice story!

ks = 0:20
p = dbinom(ks,size=10,p=1/2)
plot(ks,p, pch=16, bty="n", main="Binomial probability mass function")


Geometric

A Geometric\((p)\) is the number of Bernoulli\((p)\) trials to achieve a given number successes. The probability mass function is

\[ \Pr(X=k) = (1-p)^{k-1} p \] where \(0 \le k\). The probability mass function tells a nice story!

ks = 0:20
p = dgeom(ks,p=1/3)
plot(ks,p, pch=16, bty="n", main="Geometric probability mass function")


Poisson

A common probability model for count data is \[ \Pr(X = k) = \frac{(\lambda t)^k e^{-\lambda t}}{k!} \] This distribution looks weird, but it has a nice story, which we will see when we talk about Poisson processes.

ks = 0:20
lam = 5
p = dpois(ks,lam)
plot(ks,p, pch=16, bty="n", main="Poisson distribution")


Continuous probability distributions

Uniform

Usually defined on \([0,1]\). \(X \sim \text{Unif}(0,1)\) means the density is \[ f(x) = 1 \] for \(0\le x \le 1\).


Exponential

If \(X\) has exponential distribution, the density is

\[ f(x) = \lambda e^{-\lambda x} \]

\[ F(x) = 1-e^{-\lambda t} \]

Memoryless property: \[ \Pr(X>t+h| X>t) = \frac{\Pr(X>t+h)}{\Pr(X>t)} = \frac{e^{-\lambda(t+h)}}{e^{-\lambda t}} = e^{-\lambda h} = \Pr(X>h) \]

xs = seq(0,10,by=0.01)
lam = 2
p = dexp(xs, rate=lam)
plot(xs, p, type="l", bty="n", main="Exponential probability density")


Normal

The normal\((\mu,\sigma^2)\) density is

\[ f(x) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left[ -\frac{1}{2\sigma^2} (x-\mu)^2 \right] \]

Where does this functional form come from? It’s complicated.

xs = seq(-5,5,by=0.01)
lam = 2
p = dnorm(xs)
plot(xs, p, type="l", bty="n", main="Normal probability density")


Other probability distributions

There are lots of other “canonical” probability distributions

Discrete: Negative binomial, beta binomial, counting distributions

Continuous: Gamma, Lognormal, Laplace, survival/accelerated failure time distributions

We won’t use these in this class, but it’s important to understand that we have lots of options for modeling random quantities. This is partly what the field of statistics is about.


Decision trees

Decision trees as probability models.

Another important class of stochastic model we will see in this class is a decision tree.

A decision tree represents a sequence of decisions, actions, or states, in a probability model. You can think of an individual jumping from left to right along the nodes in the tree. At each node, they follow a link with the given probabilities. This means their path through the tree is stochastic.


Computing with decision trees

Let’s look at a simple decision tree for the treatment of individuals diagnosed with a certain disease. Following diagnosis, each individual gets either treatment A or B, with certain probabilities. Then either they survive or die by some follow-up time, depending on which treatment they got. We can use the information in this tree to learn about the treatments and optimal courses of action.

Computing “marginal” and conditional probabilities is easy via law of total probability: \[ \Pr(\text{Alive}|\text{Treatment A}) = 0.7 \] \[ \Pr(\text{Alive}|\text{Treatment B}) = 0.1 \] \[ \Pr(\text{Alive}) = 0.4 \times 0.7 + 0.6 \times 0.1 = 0.34 \]


A note about decision trees

Here is a different tree for the same problem. Suppose we want to know whether treatment A or B is better.

The total probability of being alive is \[ \Pr(\text{Alive}) = 0.9 \times 0.6 + 0.1 \times 0.9 = 0.63\]

Let’s look at the alive people and see which treatment they got: \[ \Pr(\text{Treatment A}|\text{Alive}) = 0.9 \times 0.6 / 0.63 = 0.86 \] \[ \Pr(\text{Treatment B}|\text{Alive}) = 0.1 \times 0.9 / 0.63 = 0.14 \]

It looks like Treatment A is best. Is that true?


Structural probability models

“Statistical models” are stochastic because they contain random variables. Consider the usual Normal/Gaussian linear model \[ y = \alpha + \beta x + \epsilon \] where \(\epsilon \sim N(0,\sigma^2)\). This model is stochastic because \(\epsilon\) is random.

In my opinion, you should think about so-called statistical models as mechanisitic. Look at the relationships they imply, and evaluate whether they capture the relevant dynamics of the system under study.

n = 100
eps = rnorm(100)
x = runif(n)
y = 0.1 - 2*x + eps
plot(x,y, bty="n")
lines(x,0.1-2*x)


Dynamic stochastic models

Survival models

Let \(T_i\) be the survival time of individual \(i\). A survival model is a model for \(T_i\), and is usually expressed in terms of the hazard of failure, \(\lambda_i(t)\). Formally,

\[ \lambda_i(t) = \lim_{h\to 0} \frac{\Pr(T_i \in (t,t+h)|T_i>t)}{\Pr(T_i>t)} \]

Since \(T_i\) is stochastic, this is a stochastic model.


Markov models

In this course, we will focus on Markov stochastic models in discrete and continuous time. A Markov model is one whose next event depends only on its current state, and not on its prior history.

It is sometimes possible to work with non-Markov stochastic models, but it is often not worthwhile.


Discrete-time discrete-space models


Stochastic mechanistic models

We described deterministic systems in terms of their rates earlier. But a stochastic model produces different output each time we run it. How can we describe its rates of change when its path is different every time?

We need to describe the rate of change of its transition probabilities. Consider a continuous-time Markov chain \(X(t)\) on a discrete state space \(S\). First, define the transition rate between states \(i\) and \(j\) in \(S\) as \(\lambda_{ij}\). Then define the the transition probabilities

\[ P_{ij}(t) = \Pr(X(t+h)=j|X(t)=i) \]

for \(i,j\in S\).


Kolmogorov “forward” equations (optional)

To describe the dynamics of the system, we can write a system of differential equations,

\[ \frac{dP_{ij}(t)}{dt} = \sum_{k\neq i} \lambda_{kj} P_{ik}(t) - \lambda_j P_{ij}(t) \]

where \(\lambda_j = \sum_{k\neq j} \lambda_{jk}\). When \(S\) is finite, a matrix representation can be formed

\[ \mathbf{P}'(t) = \mathbf{P}(t) \boldsymbol{\Lambda} \]

with initial condition \(\mathbf{P}(0)=\mathbf{I}\). The solution is given by the matrix exponential

\[ \mathbf{P}(t) = \exp[\boldsymbol{\Lambda} t] \]


Continuous-time Markov chains

Several properties of CTMCs are useful:

  1. Markov property
  2. Exponential waiting times to the next jump
  3. Independence of waiting time and destination
  4. Easy simulation

Drawbacks:

  1. Difficult to compute moments
  2. Difficult to compute transition probabilities
  3. Difficult to estimate parameters from discretely observed data

Homogeneous Poisson process

Now suppose, starting at time 0, we keep track of the number \(N_t\) of recurrent events that happen before time \(t\). This is called a counting process. Suppose each event \(i\) occurs after a waiting time

\[ W_i \sim \text{Exponential}(\lambda) \]

following the last event. Then \(X\) is a continuous-time Markov counting process.

What is the distribution of the number of events \(N_t\) that occur before \(t\)?

\[ N_t \sim \text{Poisson}(\lambda t) \]

So \(E[N_t] = \lambda t\). This is where the Poisson distribution comes from.


State-inhomogeneous Poisson process

Now consider a counting process whose waiting times follow

\[ W_i \sim \text{Exponential}(\lambda_i) \]

Many count-dependent counting processes are of interest. These specify a functional form for \(\lambda_i\) as a function of \(i\). Let’s look at some.


Pure birth process

Let \(X(t)\) be the number of individuals in a population, and let \(X(0)=x_0>0\) be the initial number. Suppose the population gains another individual with the rate

\[ \lambda_k = k\lambda \]

The expected number of individuals at time \(t\) is \[ E[X(t)] = x_0 e^{\lambda t} \] Exponential growth!


Pure death process

Consider a counting process \(X(t)\) and suppose the population has \(X(0)=x_0\) individuals initially. When \(X(t)=k\), the population loses one with rate \[ \mu_k = k\mu \] The expected number of individuals at time \(t\) is \[ E[X(t)] = x_0 e^{-\mu t} \] Exponential decay!


Birth-death process

Consider a counting process \(X(t)\). When \(X(t)=k\), births happen with rate \(\lambda_k = k\lambda\) and deaths with rate \(\mu_k = k\mu\).

The expected number of individuals at time \(t\) is

\[ E[X(t)] = x_0 e^{(\lambda-\mu) t} \]

So we see exponential growth or decay, depending on the relationship of \(\lambda\) and \(\mu\).


Birth-death immigration/emigration

Consider a counting process \(X(t)\). When \(X(t)=k\), births happen with rate \(k\lambda\) and immigrations with constant rate \(\alpha\). Deaths happen with rate \(k\mu\) and emigrations with rate \(\delta\).

\[\lambda_k = \alpha + k\lambda \] \[\mu_k = \delta + k_mu \]

The expectations for this model are complicated.


Hawkes self-exciting Poisson process

A useful class of self-exciting process is given by the Hawkes process:

\[ \lambda(t) = \alpha(t) + \sum_{i=1}^{N_t} M_i e^{-\gamma (t-t_i)} \]

The rate of new events is a function of the events that have already occurred. Each new event at \(t_i\) increases the rate of new events by a factor \(M_i\), which decays exponentially subsequently.


Queues

A queue is model for a list of individuals waiting to be served. Examples:

  • Emergency room
  • HIV care pipeline

Let \(X(t)\) be the number of individuals waiting. In the most basic queue model, arrivals happen as a Poisson process with rate \(\lambda\). Individiuals are served, and exit the queue, with rate \(\mu\).


Example of a queueing model

Gonsalves et al. (2017)


Simulating stochastic population models

Gillespie algorithm

The “Gillespie algorithm” performs forward simulation of a stochastic compartmental model Pineda-Krch (2008). You don’t really need to know how it works. Basically forward simulation of continuous-time Markov models is fairly computationally straightforward.

The ssa function in the GillespieSSA package implements a fast version of the Gillespie algorithm.


Example: pure birth

library(GillespieSSA)

parms <- c(lam=1)
x0 <- c(X=1)
a <- c("lam*X")
nu <- matrix(1)
out <- ssa(x0,a,nu,parms,tf=10,simName="Pure birth") 
plot(out$data[,1],out$data[,2],col="red",cex=0.5,pch=19, xlab="Time", ylab="Number")


Explanation of ssa

ssa(x0, # initial state vector
    a, # propensity vector
    nu, # state-change matrix
    parms, # model parameters
    tf # final time
        )
head(out$data)
##                      X
## timeSeries 0.0000000 1
##            0.1668619 2
##            0.5045949 3
##            0.5359663 4
##            0.8149916 5
##            0.8594286 6

Example: Logistic growth

parms <- c(b=2, d=1, K=1000)
x0 <- c(N=500)
a <- c("b*N", "(d+(b-d)*N/K)*N")
nu <- matrix(c(+1,-1),ncol=2)
out <- ssa(x0,a,nu,parms,tf=10,simName="Logistic growth")
plot(out$data[,1], out$data[,2], type="l", xlab="Time", ylab="Number")
abline(h=1000, lty="dashed", col="gray")


Example: SIR

parms <- c(beta=0.001, gamma=0.1)
x0 <- c(S=499,I=1,R=0)
a <- c("beta*S*I","gamma*I")
nu <- matrix(c(-1,0,+1,-1,0,+1),nrow=3,byrow=TRUE)
out <- ssa(x0,a,nu,parms,tf=100,simName="SIR model")
#ssa.plot(out)
plot(out$data[,1], out$data[,2], col="green", ylim=c(0,500), type="l", xlab="Time", ylab="Number")
lines(out$data[,1], out$data[,3], col="red") 
lines(out$data[,1], out$data[,4], col="blue")
legend(50,500,c("S", "I", "R"), pch=16, col=c("green", "red", "blue"))


References

Gonsalves, Gregg S, A David Paltiel, Paul D Cleary, Michael J Gill, Mari M Kitahata, Peter F Rebeiro, Michael J Silverberg, et al. 2017. “A Flow-Based Model of the Hiv Care Continuum in the United States.” Journal of Acquired Immune Deficiency Syndromes (1999) 75 (5). NIH Public Access: 548–53.

Pineda-Krch, Mario. 2008. “GillespieSSA: Implementing the Stochastic Simulation Algorithm in R.” Journal of Statistical Software 25 (12): 1–18.